4 - 24.3. Value/Policy Iteration [ID:30359]

50 von 241 angezeigt

So, the point is now how to actually compute a policy that is ideally optimal.

And we have two ways of going about that.

Either we look at the utilities of the individual actions of the individual states or we try

to compute a policy as a whole.

Value iteration does the first thing.

The basic idea is that the expected sum of rewards for some sequence of actions is the

current reward plus my discount factor times the expected reward of the sums afterwards.

You can basically derive that simply from the definition of the utility of states.

Where is it?

This thing here.

This leads to this giant Bellman equation which is going to be very important in the

future.

We say the utility of a state S is the reward at that state plus that action such that the

sum over possible neighboring states, their utility times the transition model, all of

this becomes maximal.

If you do this, you note that the utility at point one one, which is down there, this

is a bit confusing, is the reward which is just the usual cost of staying alive plus

gamma times all of this, the maximum of all of those.

We can use that to compute the utility for each state which is nice.

The annoying thing is we have one equation for each state of course but also there's

this pesky maximum thing in here which is annoying because that makes the whole thing

nonlinear which means we can't use linear algebra which means we can't just do simple

matrix manipulation and all of that.

That makes things a bit difficult to compute but still it's a first approximation.

So now how do we use this to get at an algorithm?

Well the basic idea is simple.

You just assign some utility to each node at the beginning and then you use the Bellman

equation to basically update the utilities of the neighboring states.

You use it again to update the utilities of those neighboring states and all of that.

That makes everything locally consistent in the sense of it makes the utility of each

state consistent with the utility of each neighboring state and if it does that at each

state it does it everywhere.

That's this point here where if it's everywhere locally consistent you get to global optimality.

The iteration algorithm itself is correspondingly rather straightforward.

You have all of the stuff that you're given anyway.

You start with one vector of utilities for each state.

Delta is just kept for finding out when to actually stop.

Now what do you do?

You update the utility for each state according to the Bellman equation, this joint thing.

You take the reward, you take the discount factor and rate it over the maximum blah blah

blah for all the neighboring states.

Then you just check has it changed sufficiently.

If it has not changed sufficiently that means we're rather close enough.

So you take some, this is all just termination, a termination heuristic basically.

Now if you do this you can compute a full utility for each node in your problem and

once you have the full utility for each action you can derive a policy from that.

That's the basic idea.

Now looking at this graph something funny happens.

This is exactly on this four times three field here.

So imagine what the algorithm does.

Teil einer Videoserie :

Artificial Intelligence (AI-2) SS 2021

Teil eines Kapitels:

Chapter 24. Making Complex Decisions

Presenters

Dr.-Ing. Dennis Müller

Zugänglich über

Offener Zugang

Dauer

00:21:05 Min

Aufnahmedatum

2021-03-29

Hochgeladen am

2021-03-30 15:06:30

Sprache

en-US

Explanation of the Bellman equation to compute the utility of a state. Additionally, the ideas of the Value Iteration Algorithm and the Policy Iteration Algorithm are discussed.

Einbetten

Wordpress FAU Plugin

 https://www.fau.tv/clip/id/30359

iFrame

<iframe src="https://api.video.uni-erlangen.de/services/oembed/?url=https://www.fau.tv/clip/id/30359&format=iframe&maxwidth=1280&maxheight=720" width="1280" height="720"seamless allowfullscreen style="border: 0; padding: 0; margin: 0;overflow: hidden;"></iframe>

Herunterladen

Video

Per RSS abonnieren